<!--begin.rcode results='hide', echo=FALSE, message=FALSE
library(caret)
data(BloodBrain)
library(earth)
data(etitanic)

hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
  if (is.character(x)) highr::hi_html(x) else hook_inline(x)
  })
opts_chunk$set(comment=NA)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4), 
  rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1

    end.rcode-->


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
    Design by Free CSS Templates
    http://www.freecsstemplates.org
    Released for free under a Creative Commons Attribution 2.5 License

    Name       : Emerald 
    Description: A two-column, fixed-width design with dark color scheme.
    Version    : 1.0
    Released   : 20120902

  -->
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="keywords" content="" />
    <meta name="description" content="" />
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <title>Pre-Processing</title>
    <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
    <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
    <div id="wrapper">
      <div id="header-wrapper" class="container">
	<div id="header" class="container">
	  <div id="logo">
	    <h1><a href="#">Pre-Processing</a></h1>
	  </div>
          <!--
	      <div id="menu">
		<ul>
		  <li class="current_page_item"><a href="#">Homepage</a></li>
		  <li><a href="#">Blog</a></li>
		  <li><a href="#">Photos</a></li>
		  <li><a href="#">About</a></li>
		  <li><a href="#">Contact</a></li>
		</ul>
	      </div>
              -->
	</div>
	<div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
      </div>
      <!-- end #header -->
      <div id="page">
	<div id="content">
<h1>Contents</h1>  
<ul>
  <li><a href="#dummy">Creating Dummy Variables</a></li>
  <li><a href="#nzv">Zero- and Near Zero-Variance Predictors</a></li>
  <li><a href="#corr">Identifying Correlated Predictors</a></li>
  <li><a href="#lindep">Linear Dependencies</a></li>
  <ul>
    <li><a href="#pp">The <span class="mx funCall">preProcess</span> Function</a></li>
    <li><a href="#cs">Centering and Scaling</a></li>
    <li><a href="#impute">Imputation</a></li>
    <li><a href="#trans">Transforming Predictors</a></li>
  </ul>
  <li><a href="#all">Putting It All Together</a></li>  
  <li><a href="#cent">Class Distance Calculations</a></li>  
 </ul>   
  



<p>
<a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> includes several functions to pre-process the predictor data.  Assumes that all of the data are numeric (i.e. factors have been converted to dummy variables via <span class="mx funCall">model.matrix</span>, <span class="mx funCall">dummyVars</span> or other means). 
</p>
          
          
 <div id="dummy"></div>         
 <h1>Creating Dummy Variables</h1>
 <p>
 The function <span class="mx funCall">dummyVars</span>  can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.
 </p>
<p>
For example, the <code>etitanic</code> data set in the
<a href="http://cran.r-project.org/web/packages/earth/index.html"><strong>earth</strong></a> package includes two factors:
<code>pclass1</code> (with levels 
<!--rinline I(paste(levels(etitanic$pclass),  collapse = ", ")) -->) and <code>sex</code> (with levels <!--rinline I(paste(levels(etitanic$sex),  sep = "", collapse = ", ")) -->). The base R function <span class="mx funCall">model.matrix</span> would generate the following variables:
</p>
<!--begin.rcode dummy1
library(earth)
data(etitanic)
head(model.matrix(survived ~ ., data = etitanic))
    end.rcode-->
<p>
Using <span class="mx funCall">dummyVars</span>:
</p>
<!--begin.rcode dummy2
dummies <- dummyVars(survived ~ ., data = etitanic)
head(predict(dummies, newdata = etitanic))
    end.rcode-->
<p>
Note there is no intercept and each factor has a dummy variable for each level, so this parameterization may not be useful for some model functions, such as <span class="mx funCall">lm</span>.
</p>

 <div id="nzv"></div>   
<h1>Zero- and Near Zero-Variance Predictors</h1>
<p>
In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a "zero-variance predictor"). For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. 
</p>
<p>
Similarly, predictors might have only a handful of unique values that occur with very low frequencies. For example, in the drug resistance data, the <code>nR11</code> descriptor (number of 11-membered rings) data have a few unique numeric values that are highly unbalanced:
</p>

<!--begin.rcode nzv1
data(mdrr)
data.frame(table(mdrrDescr$nR11))
    end.rcode-->

<p>
The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. These "near-zero-variance" predictors may need to be identified and eliminated prior to modeling.
</p>
<p>
To identify these types of predictors, the following two metrics can be calculated:
</p>
<ul>
  		<li> the frequency of the most prevalent value over the second most frequent value (called the "frequency ratio''), which would be near one for well-behaved predictors and very large for highly-unbalanced data></li>
<li> the "percent of unique values'' is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases></li>
</ul>

<p>
If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance.
</p>
<p>
We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. Using both criteria should not falsely detect such predictors.
</p>
<p>
Looking at the MDRR data, the <span class="mx funCall">nearZeroVar</span> function can be used to identify near zero-variance variables (the <span class="mx arg">saveMetrics</span> argument can be used to show the details and usually defaults to <code>FALSE</code>):
</p>

<!--begin.rcode nzv2
nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]
dim(mdrrDescr)
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
dim(filteredDescr)
    end.rcode-->
    
<p>
By default, <span class="mx funCall">nearZeroVar</span> will return the positions of the variables that are flagged to be problematic.
</p>

 <div id="corr"></div> 
<h1>Identifying Correlated Predictors</h1>

<p>
While there are some models that thrive on correlated predictors (such as <span class="mx funCall">pls</span>), other models may benefit from reducing the level of correlation between the predictors. 
</p>

<p>
Given a correlation matrix, the <span class="mx funCall">findCorrelation</span> function uses the following algorithm to flag predictors for removal:
</p>


<!--begin.rcode corr1
descrCor <-  cor(filteredDescr)
highCorr <- sum(abs(descrCor[upper.tri(descrCor)]) > .999)
    end.rcode-->

<p>
For the previous MDRR data, there are <!--rinline I(highCorr) --> descriptors that are almost perfectly correlated (|correlation| &gt; 0.999), such as the total information index of atomic composition (<code>IAC</code>) and the total information content index (neighborhood symmetry of 0-order) (<code>TIC0</code>) (correlation = 1). The code chunk below shows the effect of removing descriptors with absolute correlations above 0.75.
</p>


<!--begin.rcode corr2
descrCor <- cor(filteredDescr)
summary(descrCor[upper.tri(descrCor)])

highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
filteredDescr <- filteredDescr[,-highlyCorDescr]
descrCor2 <- cor(filteredDescr)
summary(descrCor2[upper.tri(descrCor2)])
    end.rcode-->

 <div id="lindep"></div> 
<h1>Linear Dependencies</h1>

<p>
The function <span class="mx funCall">findLinearCombos</span> uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist). For example, consider the following matrix that is could have been produced by a less-than-full-rank parameterizations of a two-way experimental layout:
</p>

<!--begin.rcode ld1
ltfrDesign <- matrix(0, nrow=6, ncol=6)
ltfrDesign[,1] <- c(1, 1, 1, 1, 1, 1)
ltfrDesign[,2] <- c(1, 1, 1, 0, 0, 0)
ltfrDesign[,3] <- c(0, 0, 0, 1, 1, 1)
ltfrDesign[,4] <- c(1, 0, 0, 1, 0, 0)
ltfrDesign[,5] <- c(0, 1, 0, 0, 1, 0)
ltfrDesign[,6] <- c(0, 0, 1, 0, 0, 1)
    end.rcode-->

<p>
Note that columns two and three add up to the first column. Similarly, columns four, five and six add up the first column. <span class="mx funCall">findLinearCombos</span> will return a list that enumerates these dependencies. For each linear combination, it will incrementally remove columns from the matrix and test to see if the dependencies have been resolved. <span class="mx funCall">findLinearCombos</span> will also return a vector of column positions can be removed to eliminate the linear dependencies:
</p>


<!--begin.rcode ld2
comboInfo <- findLinearCombos(ltfrDesign)
comboInfo

ltfrDesign[, -comboInfo$remove]
    end.rcode--> 

<p>
These types of dependencies can arise when large numbers of binary chemical fingerprints are used to describe the structure of a molecule.
</p>


<div id="pp"></div> 
<h1>The <span class="mx funCall">preProcess</span> Function</h1>


<p>
The <span class="mx funCall">preProcess</span> class can be used for many operations on predictors, including centering and scaling. The function <span class="mx funCall">preProcess</span> estimates the required parameters for each operation and <span class="mx funCall">predict.preProcess</span> is used to apply them to specific data sets. This function can also be interfaces when calling the <span class="mx funCall">train</span> function. 
</p>

<p>
Several types of techniques are described in the next few sections and then another example is used to demonstrate how multiple methods can be used. Note that, in all cases, the <span class="mx funCall">preProcess</span>  function estimates whatever it requires from a specific data set (e.g. the training set) and then applies these transformations to <i>any</i> data set without recomputing the values
</p>

<div id="cs"></div> 
<h2>Centering and Scaling</h2>

<p>
In the example below, the half of the MDRR data are used to estimate the location and scale of the predictors. The function <span class="mx funCall">preProcess</span> doesn't actually pre-process the data. <span class="mx funCall">predict.preProcess</span> is used to pre-process this and other data sets.
</p>

<!--begin.rcode cs
set.seed(96)
inTrain <- sample(seq(along = mdrrClass), length(mdrrClass)/2)

training <- filteredDescr[inTrain,]
test <- filteredDescr[-inTrain,]
trainMDRR <- mdrrClass[inTrain]
testMDRR <- mdrrClass[-inTrain]

preProcValues <- preProcess(training, method = c("center", "scale"))

trainTransformed <- predict(preProcValues, training)
testTransformed <- predict(preProcValues, test)
    end.rcode-->
    
<p>
The <span class="mx funCall">preProcess</span> option <span class="mx arg">"ranges"</span> scales the data to the interval [0, 1].
</p>

<div id="impute"></div> 
<h2>Imputation</h2>

<p>
<span class="mx funCall">preProcess</span> can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the  predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger <span class="mx funCall">preProcess</span> to center and scale the data, regardless of what is in the <span class="mx arg">method</span> argument. Alternatively, bagged trees can also be used to impute. For each predictor in the data, a bagged tree is created using all of the other predictors in the training set. When a new sample has a missing predictor value, the bagged model is used to predict the value. While, in theory, this is a more powerful method of imputing, the computational costs are much higher than the nearest neighbor technique.
</p>

<div id="trans"></div> 
<h2>Transforming Predictors</h2>


<p>In some cases, there is a need to use principal component analysis (PCA) to transform the data to a smaller sub&ndash;space where the new variable are uncorrelated with one another. The <span class="mx funCall">preProcess</span> class can apply this transformation by including <code>&quot;pca&quot;</code> in the <span class="mx arg">method</span> argument. Doing this will also force scaling of the predictors. Note that when PCA is requested, <span class="mx funCall">predict.preProcess</span> changes the column names to <code>PC1</code>,  <code>PC2</code> and so on.</p>

<p>Similarly, independent component analysis (ICA) can also be used to find new variables that are linear combinations of the original set such that the components are independent (as opposed to uncorrelated in PCA). The new variables will be labeled as  <code>IC1</code>,  <code>IC2</code> and so on.</p>

<p>The &quot;spatial sign&rdquo; transformation (<a
href="http://pubs.acs.org/cgi-bin/abstract.cgi/jcisd8/2006/46/i03/abs/ci050498u.html">Serneels
et al, 2006</a>) projects the data for a predictor to the unit circle
in p dimensions, where p is the number of predictors. Essentially, a
vector of data is divided by its norm. The two figures below show two
centered and scaled descriptors from the MDRR data before and after
the spatial sign transformation. The predictors should be centered and
scaled before applying this transformation.</p>

<!--begin.rcode set1
library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
    end.rcode-->  
    
<!--begin.rcode pp_SpatSignBefore, tidy=FALSE, fig.height=5, fig.width=5 
plotSubset <- data.frame(scale(mdrrDescr[, c("nC", "X4v")])) 
xyplot(nC ~ X4v,
       data = plotSubset,
       groups = mdrrClass, 
       auto.key = list(columns = 2))  
    end.rcode-->  


<p>After the spatial sign:</p>

<!--begin.rcode pp_SpatSignAfter, tidy=FALSE, fig.height=5, fig.width=5 
transformed <- spatialSign(plotSubset)
transformed <- as.data.frame(transformed)
xyplot(nC ~ X4v, 
       data = transformed, 
       groups = mdrrClass, 
       auto.key = list(columns = 2)) 
    end.rcode-->  

<p>Another option, <span class="mx arg">&quot;BoxCox&quot;</span> will estimate a Box&ndash;Cox transformation on the predictors if the data are greater than zero.</p>


<!--begin.rcode bc1
preProcValues2 <- preProcess(training, method = "BoxCox")
trainBC <- predict(preProcValues2, training)
testBC <- predict(preProcValues2, test)
preProcValues2
    end.rcode-->

<p>The <code>NA</code> values correspond to the predictors that could
not be transformed. This transformation requires the data to be greater than zero. Two similar transformations, the Yeo-Johnson and exponential transformation of Manly (1976) can also be used in <span class="mx funCall">preProcess</span>.
</p>

<div id="all"></div> 
<h1>Putting It All Together</h1>
<p>
In <i>Applied Predictive Modeling</i> there is a case study where the execution times of jobs in a high performance computing environment are being predicted. The data are:
</p>

<!--begin.rcode hpc
library(AppliedPredictiveModeling)
data(schedulingData)
str(schedulingData)
    end.rcode-->

<p>
The data are a mix of categorical and numeric predictors. Suppose we want to use the Yeo-Johnson transformation on the continuous predictors then center and scale them. Let's also suppose that we will be running a tree-based models so we might want to keep the factors as factors (as opposed to creating dummy variables). We run the function on all the columns except the last, which is the outcome. 
</p>    
    
<!--begin.rcode hpc_pp_factors
pp_hpc <- preProcess(schedulingData[, -8], 
                     method = c("center", "scale", "YeoJohnson"))
pp_hpc
transformed <- predict(pp_hpc, newdata = schedulingData[, -8])
head(transformed)
    end.rcode-->    
    
<p>
The two predictors labeled as "ignored" in the output are the two factor predictors. These are not altered but the numeric predictors are transformed. However, the predictor for the number of pending jobs, has a very sparse and unbalanced distribution:
</p>       
 
<!--begin.rcode hpc_nzv
mean(schedulingData$NumPending == 0)
    end.rcode-->  
    
<p>
For some other models, this might be an issue (especially if we resample or down-sample the data). We can add a filter to check for zero- or near zero-variance predictors prior to running the pre-processing calculations:
</p>       
  
<!--begin.rcode hpc_pp_nzv
pp_no_nzv <- preProcess(schedulingData[, -8], 
                        method = c("center", "scale", "YeoJohnson", "nzv"))
pp_no_nzv
predict(pp_no_nzv, newdata = schedulingData[1:6, -8])
    end.rcode-->  
    
<p>
Note that one predictor is labeled as "removed" and the processed data lack the sparse predictor. 
</p>      
    
    
<div id="cent"></div> 
<h1>Class Distance Calculations</h1>

<p>
<a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> contains functions to generate new predictors variables based on distances to class centroids (similar to how linear discriminant analysis works). For each level of a factor variable, the class centroid and covariance matrix is calculated. For new samples, the Mahalanobis distance to each of the class centroids is computed and can be used as an additional predictor. This can be helpful for non-linear models when the true decision boundary is actually linear.
</p>
<p>
In cases where there are more predictors within a class than samples, the <span class="mx funCall">classDist</span> function has arguments called <span class="mx arg">pca</span> and <span class="mx arg">keep</span> arguments that allow for principal components analysis within each class to be used to avoid issues with singular covariance matrices. 
</p>
<p>
<span class="mx funCall">predict.classDist</span> is then used to generate the class distances. By default, the distances are logged, but this can be changed via the <span class="mx arg">trans</span> argument to <span class="mx funCall">predict.classDist</span>.
</p>
<p>
As an example, we can used the MDRR data.
</p>
<!--begin.rcode cd1
centroids <- classDist(trainBC, trainMDRR)
distances <- predict(centroids, testBC)
distances <- as.data.frame(distances)
head(distances)
    end.rcode-->
<p>
This image shows a scatterplot matrix of the class distances for the held-out samples:
</p>

  
<!--begin.rcode pp_splom, tidy=FALSE, fig.height=5, fig.width=5 
xyplot(dist.Active ~ dist.Inactive,
       data = distances, 
       groups = testMDRR, 
       auto.key = list(columns = 2))
    end.rcode--> 

	  <div style="clear: both;">&nbsp;</div>
	</div>
	<!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
	<div style="clear: both;">&nbsp;</div>
      </div>
      <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
      <!-- end #page -->
    </div>
    <div id="footer-content"></div>
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)    
    end.rcode--> 
 
    <div id="footer">
      <p>Created on <!--rinline I(session) -->.</p>
    </div>
    <!-- end #footer -->
  </body>
</html>
